{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AIVmAXwA_1kz"
      },
      "source": [
        "## Evaluating the Spread of AI-Generated Synthetic Media on X\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "in0v7TiL_6R9"
      },
      "source": [
        "## 1. Prepare Environment and Load Data"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "AGyXyb2xqtc3"
      },
      "outputs": [],
      "source": [
        "from google.colab import drive\n",
        "import pandas as pd\n",
        "from datetime import datetime\n",
        "drive.mount('/content/drive')\n",
        "\n",
        "DATA_LOC = \"LOC/community-notes.tsv\"\n",
        "communitynotes_data = pd.read_csv(DATA_LOC, delimiter = '\\t')\n",
        "\n",
        "communitynotes_data[\"date\"] = pd.to_datetime(communitynotes_data['createdAtMillis'], unit='ms')\n",
        "communitynotes_data[\"date\"] = communitynotes_data[\"date\"].dt.strftime('%d/%m/%Y')\n",
        "communitynotes_data = communitynotes_data[(pd.to_datetime(communitynotes_data['date'], format='%d/%m/%Y') > '01-11-2022') & (pd.to_datetime(communitynotes_data['date'], format='%d/%m/%Y') < '30-09-2023')]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9mi9uUZE_-J3"
      },
      "source": [
        "## 2. Extract Community Notes by Keyword and Save Data as Dictionary"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "uOEVKryvPcCG"
      },
      "source": [
        "This query filters for instances of AI-generated visual content that could be misleading, such as \"deepfakes\" or AI-generated images. However, it deliberately excludes strings related to explicitly AI-generated content, such as the hashtag #aiart, focusing on capturing only those cases where the audience could potentially be misled due to the absence of clear labeling."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_LElJglrQOrw"
      },
      "outputs": [],
      "source": [
        "import pandas as pd\n",
        "import pickle\n",
        "\n",
        "def filter_dataframe(full_df, expressions):\n",
        "    pattern = '|'.join(expressions)\n",
        "    new_df = full_df[full_df['summary'].str.contains(pattern, case=False, na=False)]\n",
        "    return new_df\n",
        "\n",
        "expressions = [\"ai-generated(?:\\s+image|\\s+video|\\s+art|\\s+photo|\\s+deepfake)\",\n",
        "               \"ai generated(?:\\s+image|\\s+video|\\s+art|\\s+photo|\\s+deepfake)\",\n",
        "               \"ai(?:\\s+image|\\s+video|\\s+art|\\s+photo|\\s+deepfake)\",\n",
        "               \"ai-(?:\\s+image|\\s+video|\\s+art|\\s+photo|\\s+deepfake)\",\n",
        "               \"generated\\s+(?:with|by)\\s+(?:ai|artificial intelligence)\",\n",
        "               \"midjourney\", \"stable diffusion\", \"dall-e\", \"deepfake\",\n",
        "               \"deep fake\", \"deepfaked\"]\n",
        "\n",
        "visual_disinformation = filter_dataframe(communitynotes_data, expressions)\n",
        "no_visual_disinformation = communitynotes_data.drop(visual_disinformation.index)\n",
        "\n",
        "community_notes = {'visual': visual_disinformation, 'no_visual': no_visual_disinformation}\n",
        "\n",
        "with open('community_notes.pkl', 'wb') as f:\n",
        "    pickle.dump(community_notes, f)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Kp8FP5o9UgmD"
      },
      "outputs": [],
      "source": [
        "community_notes['visual'].shape[0]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U0u33E7sRF6-"
      },
      "source": [
        "## 3. Collect Data From Source Tweets"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VjZRlbrojhyR"
      },
      "source": [
        "Here, we leverage Selenium's headless browser capabilities to systematically obtain data from the tweets to which the filtered community notes refer. The scraped data encompasses a wide range of metrics, including usernames, follower counts, tweet impressions, retweets, likes, bookmarks, as well as the tweet content itself.  Additionally, this pipeline is designed to automatically extract and store images embedded within these tweets, saving them to a Google Drive folder for subsequent analysis. It should be noted that, due to Twitter's dynamic content loading, the current implementation is unable to capture video content."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "15ImLbF5m_ex"
      },
      "source": [
        "### 3.1 Prepare Environment and Create Chrome Driver for Selenium"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "U9G-MWOimUfp"
      },
      "outputs": [],
      "source": [
        "!pip install selenium\n",
        "from selenium import webdriver\n",
        "from selenium.webdriver.common.by import By\n",
        "import time\n",
        "import os\n",
        "import re\n",
        "import pandas as pd\n",
        "import urllib.request\n",
        "from google.colab import drive\n",
        "\n",
        "def create_chrome_driver():\n",
        "    chrome_options = webdriver.ChromeOptions()\n",
        "    chrome_options.add_argument('--headless')\n",
        "    chrome_options.add_argument('--no-sandbox')\n",
        "    chrome_options.add_argument('--disable-dev-shm-usage')\n",
        "    return webdriver.Chrome(options=chrome_options)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SAzaaDHvnDxE"
      },
      "source": [
        "### 3.2 Run the Main Data Collection Pipeline"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "y8oBB9s8f1XC"
      },
      "outputs": [],
      "source": [
        "def generate_tweet_url(tweet_id):\n",
        "    return f\"https://twitter.com/anyuser/status/{tweet_id}\"\n",
        "\n",
        "def collect_tweet_data(source_tweets,OUTPUT_DIR_IMAGES, OUTPUT_FILE_INTERMEDIATE):\n",
        "    drive.mount('/content/drive')\n",
        "    driver = create_chrome_driver()\n",
        "    source_tweets['tweetUrl'] = source_tweets['tweetId'].apply(generate_tweet_url)\n",
        "\n",
        "    patterns = {\n",
        "        'views': re.compile(r'([\\d,.]+[KkMm]?)\\s*\\n\\s*Views'),\n",
        "        'reposts': re.compile(r'([\\d,.]+[KkMm]?)\\s*\\n\\s*Reposts'),\n",
        "        'quotes': re.compile(r'([\\d,.]+[KkMm]?)\\s*\\n\\s*Quotes'),\n",
        "        'likes': re.compile(r'([\\d,.]+[KkMm]?)\\s*\\n\\s*Likes'),\n",
        "        'bookmarks': re.compile(r'([\\d,.]+[KkMm]?)\\s*\\n\\s*Bookmarks')\n",
        "    }\n",
        "\n",
        "    df = pd.DataFrame(columns=['tweetId', 'tweetDate', 'username', 'count', 'tweetUrl', 'imageUrl', 'views', 'reposts', 'quotes', 'likes', 'bookmarks', 'text'])\n",
        "    os.makedirs(OUTPUT_DIR_IMAGES, exist_ok=True)\n",
        "\n",
        "    for i, row in source_tweets.iterrows():\n",
        "        tweet_id = row['tweetId']\n",
        "        tweet_url = row['tweetUrl']\n",
        "        print(f\"Processing tweet ID: {tweet_id}\")\n",
        "\n",
        "        driver.get(tweet_url)\n",
        "        time.sleep(15)\n",
        "\n",
        "        new_row = {'tweetId': tweet_id, 'count': row['count'], 'tweetUrl': tweet_url}\n",
        "\n",
        "        try:\n",
        "            new_row['tweetDate'] = driver.find_element(By.XPATH, '//time').text\n",
        "            new_row['username'] = driver.find_elements(By.CSS_SELECTOR, 'div a')[6].text\n",
        "            full_text = driver.find_elements(By.CSS_SELECTOR, 'div div')[0].text\n",
        "            new_row['text'] = driver.find_elements(By.CSS_SELECTOR, 'div span')[18].text\n",
        "            images = driver.find_elements(By.XPATH, '//img[@alt=\"Image\"]')\n",
        "            if images:\n",
        "                new_row['imageUrl'] = images[0].get_attribute('src')\n",
        "                urllib.request.urlretrieve(new_row['imageUrl'], f\"{OUTPUT_DIR_IMAGES}/image_{tweet_id}.jpg\")\n",
        "            else:\n",
        "                new_row['imageUrl'] = None\n",
        "\n",
        "            for key, pattern in patterns.items():\n",
        "                match = pattern.search(full_text)\n",
        "                new_row[key] = match.group(1) if match else None\n",
        "\n",
        "        except Exception as e:\n",
        "          print(f\"An error occurred: {e}\")\n",
        "          for key in ['tweetDate', 'username', 'text', 'imageUrl', 'views', 'reposts', 'quotes', 'likes', 'bookmarks']:\n",
        "             new_row[key] = \"TWEET-NOT-FOUND\"\n",
        "\n",
        "        df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)\n",
        "        df.to_csv(OUTPUT_FILE_INTERMEDIATE, index=False)\n",
        "\n",
        "        if (i + 1) % 40 == 0:\n",
        "            print(\"Pausing for 10 minutes after processing 40 tweets.\")\n",
        "            time.sleep(600)\n",
        "\n",
        "    driver.quit()\n",
        "    return df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "vWMxmXcLGLFs"
      },
      "outputs": [],
      "source": [
        "OUTPUT_DIR_IMAGES = \"LOC/Def\"\n",
        "OUTPUT_FILE_INTERMEDIATE = \"LOC/Def\"\n",
        "\n",
        "hydrated_df = collect_tweet_data(unique_to_visual,OUTPUT_DIR_IMAGES,OUTPUT_FILE_INTERMEDIATE) #previous names were all source_tweets"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "gbgjhe7e8kZ_"
      },
      "outputs": [],
      "source": [
        "OUTPUT_DIR_IMAGES = \"LOC/Def\"\n",
        "OUTPUT_FILE_INTERMEDIATE = \"LOC/Def\"\n",
        "\n",
        "source_tweets = community_notes['visual']['tweetId'].value_counts().reset_index()\n",
        "source_tweets.columns = ['tweetId', 'count']\n",
        "\n",
        "hydrated_df = collect_tweet_data(source_tweets,OUTPUT_DIR_IMAGES,OUTPUT_FILE_INTERMEDIATE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yOUtMDzXz2xY"
      },
      "source": [
        "### 3.3. Collect Data on Follower Counts"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "V3eHAFc1lrnx"
      },
      "outputs": [],
      "source": [
        "def collect_followers_count(df,OUTPUT_FILE_USERS):\n",
        "    driver = create_chrome_driver()\n",
        "    df['userFollowers'] = None  # Initialize the userFollowers column\n",
        "\n",
        "    for i, row in df.iterrows():\n",
        "        username = row.get('username', None)\n",
        "\n",
        "        if username is None or username == 'TWEET-NOT-FOUND':\n",
        "            print(\"Skipping row with missing or TWEET-NOT-FOUND username.\")\n",
        "            df.at[i, 'userFollowers'] = 'TWEET-NOT-FOUND'\n",
        "            continue\n",
        "\n",
        "        username = username.lstrip('@')\n",
        "        url = f\"https://twitter.com/{username}\"\n",
        "        print(f\"Collecting followers count for: {username}\")\n",
        "\n",
        "        driver.get(url)\n",
        "        time.sleep(15)\n",
        "\n",
        "        try:\n",
        "            all_span_elements = driver.find_elements(By.CSS_SELECTOR, 'span span')\n",
        "            for j, element in enumerate(all_span_elements):\n",
        "                if element.text == 'Followers':\n",
        "                    follower_count = all_span_elements[j - 1].text\n",
        "                    df.at[i, 'userFollowers'] = follower_count\n",
        "                    print(f\"Follower count for {username}: {follower_count}\")\n",
        "                    break\n",
        "\n",
        "        except Exception as e:\n",
        "            print(f\"An error occurred while collecting followers count for {username}: {e}\")\n",
        "\n",
        "        if (i + 1) % 40 == 0:\n",
        "            print(\"Saving intermediate data.\")\n",
        "            df.to_csv(OUTPUT_FILE_USERS, index=False)\n",
        "            print(\"Pausing for 10 minutes.\")\n",
        "            time.sleep(600)\n",
        "\n",
        "    driver.quit()\n",
        "    df.to_csv(OUTPUT_FILE_USERS, index=False)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "x4c9id4u-XVv"
      },
      "outputs": [],
      "source": [
        "OUTPUT_FILE_USERS = \"/LOC/Def\"\n",
        "\n",
        "hydrated_df_users = collect_followers_count(hydrated_df,OUTPUT_FILE_USERS)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7418jgkXYEtn"
      },
      "source": [
        "### 3.4 Clean Data"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yFgGL0Y3kres"
      },
      "source": [
        "Finally, to ensure the data's reliability and consistency for subsequent analyses, we clean the data prior to its use. This involves the standardization of numerical conventions and the transformation of the dates into a standard format (day/month/year)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "UModBA_NekVN"
      },
      "outputs": [],
      "source": [
        "from datetime import datetime\n",
        "\n",
        "def clean_dataframe(df):\n",
        "    current_year = \"2023\"\n",
        "\n",
        "    def transform_value(x):\n",
        "        if pd.isna(x) or x == '': return '0'\n",
        "        try:\n",
        "            num = float(x.replace('M', '').replace('K', '').replace(',', ''))\n",
        "            num *= 1e6 if 'M' in x else 1e3 if 'K' in x else 1\n",
        "            return (f\"{num/1e6:.1f}m\" if num >= 1e6 else f\"{num/1e3:.1f}k\").rstrip('0').rstrip('.') if num >= 1e3 else str(int(num))\n",
        "        except: return x\n",
        "\n",
        "    def transform_date(d):\n",
        "        if not isinstance(d, str): return None\n",
        "        try: return datetime.strptime(d, '%I:%M %p · %b %d, %Y').strftime('%d/%m/%Y')\n",
        "        except: pass\n",
        "        try: return datetime.strptime(d, '%b %d, %Y').strftime('%d/%m/%Y')\n",
        "        except: pass\n",
        "        try: return datetime.strptime(f\"{d}, {current_year}\", '%b %d, %Y').strftime('%d/%m/%Y')\n",
        "        except: return None\n",
        "\n",
        "    for col in ['views', 'reposts', 'quotes', 'likes', 'bookmarks', 'userFollowers']:\n",
        "        df[col] = df[col].astype(str).apply(transform_value)\n",
        "\n",
        "    df['tweetDate'] = df['tweetDate'].apply(transform_date)\n",
        "    df.rename(columns={'count': 'notesCount'}, inplace=True)\n",
        "\n",
        "    return df"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "-b4CeKIGek2U"
      },
      "outputs": [],
      "source": [
        "OUTPUT_FILE_FINAL = \"LOC/Def\"\n",
        "processed_df = clean_dataframe(hydrated_df)\n",
        "processed_df.to_csv(OUTPUT_FILE_FINAL, index=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9Dt5sofLKnOY"
      },
      "source": [
        "## Validation through Krippendorff's Alpha"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "KL8gfbNTKpTL",
        "outputId": "c4b93456-9628-47d1-c6f0-2cebc972b483"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "3 validation CSV files have been created.\n"
          ]
        }
      ],
      "source": [
        "import numpy as np\n",
        "for i in range(1, 4):\n",
        "    # Sampling without replacement to avoid duplicate rows in different validators\n",
        "    sample = annotated_data[['tweetId', 'tweetUrl']].sample(n=30, replace=False, random_state=i)\n",
        "    # Add empty columns for 'verified', 'political', and 'media'\n",
        "    sample['verified'] = np.nan\n",
        "    sample['political'] = np.nan\n",
        "    sample['media'] = np.nan\n",
        "\n",
        "    # Export to CSV\n",
        "    sample.to_csv(f'validation_{i}.csv', index=False)\n",
        "\n",
        "print(\"3 validation CSV files have been created.\")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "xAihFoyteTJI",
        "outputId": "a9ead9da-6f11-4d35-c83f-d226c1b6d4bf"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Columns in file1: Index(['tweetId', 'tweetUrl', 'verified', 'political', 'media'], dtype='object')\n",
            "Columns in file2: Index(['tweetId', 'tweetUrl', 'verified', 'political', 'media'], dtype='object')\n",
            "Columns in file3: Index(['tweetId', 'tweetUrl', 'verified', 'political', 'media'], dtype='object')\n",
            "Krippendorff's Alpha for 'verified': 0.8036945133719328\n",
            "Krippendorff's Alpha for 'political': 0.8910546139359699\n",
            "Krippendorff's Alpha for 'media': 0.7931568754034861\n"
          ]
        }
      ],
      "source": [
        "import pandas as pd\n",
        "import krippendorff\n",
        "\n",
        "def calculate_krippendorff_alpha(file1, file2, file3):\n",
        "    def convert_column(column):\n",
        "        # Define a mapping for standard categories and special cases\n",
        "        mapping = {\n",
        "            'TRUE': 1, 'FALSE': 0,\n",
        "            'POLITICAL': 1, 'NON-POLITICAL': 0,\n",
        "            'IMAGE': 1, 'VIDEO': 0,\n",
        "            'NON-AI-GEN': 2, 'REMOVED': 3, 'NO-DATA': 4,\n",
        "            'NO-MEDIA': 5\n",
        "        }\n",
        "        return column.map(mapping)\n",
        "\n",
        "    def calculate_alpha(data1, data2, data3, column_name):\n",
        "        # Convert the categorical data into numerical codes\n",
        "        combined_data = [\n",
        "            convert_column(data1[column_name].copy()),\n",
        "            convert_column(data2[column_name].copy()),\n",
        "            convert_column(data3[column_name].copy())\n",
        "        ]\n",
        "\n",
        "        # Calculate and return Krippendorff's Alpha\n",
        "        return krippendorff.alpha(reliability_data=combined_data)\n",
        "\n",
        "    # Read the data\n",
        "    df1 = pd.read_csv(file1)\n",
        "    df2 = pd.read_csv(file2)\n",
        "    df3 = pd.read_csv(file3)\n",
        "\n",
        "    # Debugging: Print the column names to check\n",
        "    print(\"Columns in file1:\", df1.columns)\n",
        "    print(\"Columns in file2:\", df2.columns)\n",
        "    print(\"Columns in file3:\", df3.columns)\n",
        "\n",
        "    # Align the data based on 'tweetId'\n",
        "    df1 = df1.set_index('tweetId')\n",
        "    df2 = df2.set_index('tweetId')\n",
        "    df3 = df3.set_index('tweetId')\n",
        "\n",
        "    # Reindex the dataframes to align them\n",
        "    common_index = df1.index.intersection(df2.index).intersection(df3.index)\n",
        "    df1 = df1.reindex(index=common_index)\n",
        "    df2 = df2.reindex(index=common_index)\n",
        "    df3 = df3.reindex(index=common_index)\n",
        "\n",
        "    # Calculate Krippendorff's Alpha for each category\n",
        "    alpha_verified = calculate_alpha(df1, df2, df3, 'verified')\n",
        "    alpha_political = calculate_alpha(df1, df2, df3, 'political')\n",
        "    alpha_media = calculate_alpha(df1, df2, df3, 'media')\n",
        "\n",
        "    # Return the results\n",
        "    return alpha_verified, alpha_political, alpha_media\n",
        "\n",
        "# Usage example:\n",
        "alpha_values = calculate_krippendorff_alpha('validation_1.csv', 'validation_2.csv', 'validation_3.csv')\n",
        "print(\"Krippendorff's Alpha for 'verified':\", alpha_values[0])\n",
        "print(\"Krippendorff's Alpha for 'political':\", alpha_values[1])\n",
        "print(\"Krippendorff's Alpha for 'media':\", alpha_values[2])\n"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "machine_shape": "hm",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
